Sequencing and Sequence Data

Jelmer Poelstra

MCIC Wooster, OSU

2024-01-25

DNA sequencing technologies

DNA sequencing technologies: overview

  • The first generation: Sanger sequencing (since 1977)
    Sequences a single, PCR-amplified, short-ish (≤900 bp) DNA fragment at a time

Two types of high-throughput sequencing (HTS), which sequence 105-109 usually randomly selected DNA fragments (reads) at a time:

  • Short-read sequencing
    • Produce up to billions of ≤300 bp “reads”
    • Market dominated by Illumina
    • Since 2005 — now “stable”
    • AKA Next-generation Sequencing (NGS)
  • Long-read sequencing
    • Reads much longer than in NGS but fewer, less accurate, and more costly per base
    • Two main companies: Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio)
    • Since 2011 — remains under rapid development

Sequencing technology development timeline

https://en.wikipedia.org/wiki/DNA_sequencing#/media/File:History_of_sequencing_technology.jpg

Sequencing cost through time

Sanger sequencing

Sequencing is performed by synthesizing a new DNA strand in part with fluorescently-labeled nucleotides (one color per base).

Visualization is not done in real time, but after the fact — sequences of each possible length are produced (with flurorescent labeling only for the last base), and these can be separated afterwards.

The final result is a chromatogram that can be base-called:

https://dnacore.mgh.harvard.edu/new-cgi-bin/site/pages/sequencing_pages/seq_troubleshooting.jsp


The entire human genome was sequenced with Sanger technology! (More on that later.)

Sanger sequencing (cont.)

Amplification of a DNA fragment can be done through bacterial cloning or PCR — these days mostly with PCR.

  • This means that you need to (approximately) know in advance short flanking sequences to the sequence of interest — primers for your PCR.

  • Introns are good targets to sequence: variable sequences flanked by conserved sequences (exons) in which primers can be designed.


Common current applications of Sanger sequencing include:

  • Examining variation among individuals or populations in one or more candidate or marker genes (for population genetics, phylogenetics, functional inferences, etc.)

  • Taxonomic identification of a sample

Sequencing characteristics: read length

When are longer reads useful?

  • Genome assembly

  • Haplotyping

  • Transcript isoform identification

  • Taxonomic identification of single reads (microbial metabarcoding)

When does it not matter (as much)?

  • Read-as-a-tag: when we just need to know the a read’s origin in a reference genome, like in counting applications such as RNA-seq

  • Variant analysis

What about RNA sequencing?

This lecture technically deals with DNA sequencing — however, it includes the indirect sequencing of RNA after reverse transcription to cDNA. (The direct sequencing of RNA is possible but hard and outside of the scope of this lecture.)

Sequencing characteristics: error rates

Currently, no sequencing technology is error-free, and several types of errors can occur:

  • Base call errors, e.g. a base that was called as an A may instead be a G.

  • Insertion or deletion (indel) errors

  • When the base calling software is not confident at all, it can also Ns (= undetermined).

Quality scores in sequence data

When you receive sequences from a high-throughput sequencer, base calls have typically already been made. You will then receive your reads in so-called FASTQ files (more on those later) and every base in every read will be accompanied by a quality score, which is inversely related to the estimated error probability.

Overcoming sequencing errors

Coverage

Distinguishing sequencing errors from biological variation

Random vs nonrandom errors

Sequencing technologies: Illumina

  • More reads, lower per-base cost, and lower error rates and than long-read sequencing. The lower error rate advantage is disappearing as long-read technologies keep improving (and Illumina does not).

  • Like Sanger, sequencing is done by synthesizing a new strand and using fluorescently labeled bases.

[TODO: Include Illumina]

Sequencing technologies: Illumina machines

[TODO: Table of machines]

Sequencing technologies: Illumina libraries

  • Adapters etc

  • Single-end vs. paired-end

Sequencing technologies: ONT

  • Includes a very small sequencer, the MinION

  • Continuing rapid development in technology and bioinformatics software

High-throughput sequencing applications

  • Whole-genome assembly: very high-depth, best with a combination of long and short reads)

  • Variant analysis for population genetics/genomics, molecular evolution, GWAS:

    • Whole-genome resequencing

    • Reduced-representation libraries (e.g. RADseq, GBS)

  • Transcriptomics with RNA-seq

  • Other functional sequencing methods like ChIP-seq, Methyl-seq, etc

  • Microbial community characterization

    • Metabarcoding

    • Shotgun metagenomics

Sequence data files

Sequence only: FASTA

Sequence files and other genomic data files are plain-text files. We will see a couple more formats when learning about RNA-seq next week, but today we will learn about the FASTA format.

Sequence reads with quality: FASTQ

Annotation: GTF/GFF

Alignments: SAM/BAM

Sequence databases

Sequence databases

  • NCBI Genbank

  • NCBI RefSeq

  • NCBI SRA

Proteins:

  • UniProt

  • Protein Data Bank (3D structures)

Genome assemblies

Finding your reference genome